-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF: Add Secure Validator Mode #2486
Conversation
- Shutdown the worker when we can’t enable security features that we were able to enable in the host. - Add `secure_validator_mode` to `SecurityStatus`, pass it to workers this way. - Pass security options to worker over the socket instead of CLI args. (Looks much cleaner.) - Refactor `spawn_with_program_path` to only log once - Refactored worker data (see “Worker data” at #1576). This was driving me crazy. - Cleaned up a couple of old TODOs.
- (involved a bit of a refactor) - Move change_root to its own file - Refactor change_root errors
…-mode' into mrcnski/pvf-add-secure-validator-mode
// Landlock is present on relatively recent Linuxes. This is optional if the unshare | ||
// capability is present, providing FS sandboxing a different way. | ||
CannotEnableLandlock(_) => security_status.can_unshare_user_namespace_and_change_root, | ||
// seccomp should be present on all modern Linuxes unless it's been disabled. | ||
CannotEnableSeccomp(_) => false, | ||
CannotUnshareUserNamespaceAndChangeRoot(_) => false, | ||
// Should always be present on modern Linuxes. If not, Landlock also provides FS | ||
// sandboxing, so don't enforce this. | ||
CannotUnshareUserNamespaceAndChangeRoot(_) => security_status.can_enable_landlock, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we allow any security features to not be enabled in secure mode.
I view secure mode as an all-or-nothing type thing, with our safest options and recommendations. Otherwise, why bother having a secure mode when we could instead do a best-effort security and just log the errors, without killing the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. Our approach has changed a bit over time, but we settled on this for mostly practical reasons. Since we have two FS sandboxing features, a validator can be missing one and still be reasonably safe. On the other hand, we "require" seccomp because it's our only networking sandboxing feature.
You can see in telemetry that a large amount of validators (like at least 1/3) is on Ubuntu 20.04.6 LTS. (A bit better than when I last checked in May, it was more like 50%.) Enforcing 5.13+ would "require" a lot of people to upgrade on a timeline that may not be convenient for them.
Remember, if a validator can't run securely he has to either upgrade his setup (which may be painful), use the insecure flag (which is not ideal), or stay on an old node version (not great either). So, we tried to minimize "breakage". We've also communicated already that only one of the FS features will be "required". We could always switch to enforcing everything, and communicate it properly in advance, but with Godzilla PolkaVM coming it may not be worth it.
A final note. Some security capabilities being present on some machines and missing on others may lead to indeterminism that can be exploited to cause disagreements/disputes/stalls. However, we are so far from determinism anyway that this was, in the end, a low priority of the security work. If we had more time, we could have done that virtualization work while iteratively making the security "requirements" more strict. And I have been putting "require" in quotes because, as Basti has said multiple times, we should not require anything. None of this is in the spec, and validators can always build from source with the barriers removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are so far from determinism anyway
To clarify, if an attacker can execute arbitrary code, there are already many other means by which he can obtain and exploit randomness. We decided that protecting the chain was not feasible with this work, and we have governance to deal with that sort of thing, so we prioritized securing validator machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see in telemetry that a large amount of validators (like at least 1/3) is on Ubuntu 20.04.6 LTS
The telemetry page unfortunately show 70% of the kernel versions as "unknown" :(
According to this page, Ubuntu 20.04.5 should have at least kernel version 5.15.
use the insecure flag (which is not ideal)
Well then, what's the purpose of having the insecure flag? The insecure flag should still apply security sandboxing on a best-effort basis, is that right?
Anyway, those were my 2 cents, I won't be against merging it as it is either :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Alin has a point. Even in insecure mode, still enabling in best-effort makes senses. For secure mode: As @mrcnski said, we have multiple layers of restricting filesystem access. If we assume each and every one of them is sufficient by itself, it makes sense to say the validator is secure as long as at least one of them can be enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it from here, but I guess that is wrong. 🤷 Someone in the comments there even proved it is wrong. I guess Ubuntu "patch" versions can bump the Linux kernel version? Anyway, that is good news. I can raise a follow-up for enforcing it.
Well that's older I would assume. I assume ubuntu LTS should get updates that include kernel upgrades.
We want validators to be secure because a validator getting compromised is a big deal, but we have the constraint that we can't actually require it. So we just make it inconvenient and scary to run insecurely.
Of course, I was suggesting to encourage them to run with all layers.
If we assume each and every one of them is sufficient by itself, it makes sense to say the validator is secure as long as at least one of them can be enabled.
Agreed, but then why do we have landlock is we think chroot is enough 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but then why do we have landlock is we think chroot is enough 😛
Indeed, birdcage removed landlock because they deemed it was redundant on top of mount namespaces.
There are a few reasons we have it:
- A lot of this sandboxing work was a research project for me, and I found out about namespaces/pivot_root after landlock. (Still don't fully understand the former tbh.) As I understand from the above discussion, namespaces alone are merely "good enough".
- Landlock provides finer-grained control (didn't end up being useful for us, but gives us more flexibility for the design space in the future).
- Landlock is still expanding its capabilities and will only get more powerful with time.
- I believe it's possible that unshare/pivot_root may not be available on some systems, so it's nice to have a fallback.
- I don't see any downsides to having both, and defense-in-depth has been our philosophy since the beginning stages. It's also the approach advocated by the landlock dev in the above PR:
layering of mount/Landlock definitely works. I'm suggesting to use namespaces, then seccomp (locking down the availability of user namespaces and uncommon syscalls), then Landlock (partial restrictions but could still help with fine-grained access control, if supported by the running system).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's possible that unshare/pivot_root may not be available on some systems, so it's nice to have a fallback.
I think that's highly unlikely
That being said, sounds fair 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @alindima!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, birdcage removed landlock because they deemed it was redundant on top of mount namespaces.
I found out that a different project, Emilua, has taken the opposite approach:
Linux namespaces are too convoluted and it became clear how inappropriate they really are to just create sandboxes. Starting from the just newly released version, Landlock and Capsicum were added as alternatives to Linux namespaces.
Their sandboxing docs are really good. It's a bit too late for my work to benefit from them, oh well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good!
My only doubt is why we don't enforce the requirement for development mode (running --dev --alice
on an insecure host warns about security but doesn't terminate). I understand it's for a better developer experience. But I ask myself again, how good is it that the DX diverges so much from the UX? We almost never test things in a production context during development. That may result in some undiscovered pitfalls and the wider the gap between DX and UX, the more pitfalls may fit into it. Just a general thought to think about.
Good remark! Especially we should notice when the UX sucks ... The problem is, some developers us MAC and there secure mode does not work - right? |
It's a really good point, but it's not realistic for devs to test everything. I think between the merge window closing and the actual release, a dedicated person/process should be doing actual testing of the node and we should be open to backporting fixes. I didn't mention it before because I didn't want to sound like I was making lame excuses for things like this and this. But like eskimor says, I and many other devs work on a Mac. I do my due diligence by testing on a Linux VPS, but we can't expect all devs to do that (and my VPS account is root so it didn't catch some things.) Thanks for the review! 😊 |
@pepoviola zombienet tests are failing. I don't know how to check it but i guess it's because the new argument is required in CI. Can you help me update zombienet in CI to include the fix? |
yes, we need a new version with the fix we merge in zombienet. Let me create a new one and update the version here. |
@mrcnski this test fails because we are first spawning a node using the |
Thanks @pepoviola! Can we re-generate the |
This require some changes in zombienet, let me try to add the posibility to add/remove flags in the DSL and we can touch base tomorrow. |
Hi @mrcnski, last commit fix the upgrade test. |
* tsv-disabling: (155 commits) Fix failing rc-automation GHA (#2648) [ci] Return CI_IMAGE variable (#2647) Support querying peer reputation (#2392) [ci] Update rust to 1.74 (#2545) Relax approval requirements on CI files (#2564) Added AllSiblingSystemParachains matcher to be used at a parachain level (#2422) Improve polkadot sdk docs (#2631) Bridges subtree update (#2602) pallet-xcm: add new flexible `transfer_assets()` call/extrinsic (#2388) [ci] Move rust-features.sh script to .gitlab folder (#2630) Bump parity-db from 0.4.10 to 0.4.12 (#2635) sp-core: Rename VrfOutput to VrfPreOutput (#2534) chore: fix typo (#2596) Bump tracing-core from 0.1.31 to 0.1.32 (#2618) chore: fixed std wasm build of xcm (#2535) Fix PRdoc that have been previously drafted with older schema (#2623) Github Workflow migrations (#1574) Bridges update subtree (#2625) PVF: Add Secure Validator Mode (#2486) statement-distribution: Add tests for incoming acknowledgements (#2498) ...
* tsv-disabling: (155 commits) Fix failing rc-automation GHA (#2648) [ci] Return CI_IMAGE variable (#2647) Support querying peer reputation (#2392) [ci] Update rust to 1.74 (#2545) Relax approval requirements on CI files (#2564) Added AllSiblingSystemParachains matcher to be used at a parachain level (#2422) Improve polkadot sdk docs (#2631) Bridges subtree update (#2602) pallet-xcm: add new flexible `transfer_assets()` call/extrinsic (#2388) [ci] Move rust-features.sh script to .gitlab folder (#2630) Bump parity-db from 0.4.10 to 0.4.12 (#2635) sp-core: Rename VrfOutput to VrfPreOutput (#2534) chore: fix typo (#2596) Bump tracing-core from 0.1.31 to 0.1.32 (#2618) chore: fixed std wasm build of xcm (#2535) Fix PRdoc that have been previously drafted with older schema (#2623) Github Workflow migrations (#1574) Bridges update subtree (#2625) PVF: Add Secure Validator Mode (#2486) statement-distribution: Add tests for incoming acknowledgements (#2498) ...
Hey @mrcnski. Not sure if this is the right place to share this and thus ask for some guidance. So please pardon me if this is out of line (I'm not a regular github user). After updating our Kusama validator to the latest version (1.7.0 - and now 1.7.1), we're getting errors (see below) similar to those you shared above in the original post. We're running our validator on a bare metal box running Ubuntu 20.04 with the latest kernel (6.7.5). Additionally, we have SECURE BOOT disabled in the BIOS. We thought b/c we were running an older kernel version (5.15.0-88-generic) that updating to the latest version would solve our problem. Alas, it did not. We are able to run our node using the "insecure" flag, but obviously that isn't preferable and we'd like to keep our validator secure and not put the network at risk in any way. Might you (or anyone else) be able to give us some guidance on what we need to do to alleviate our issue? Errors:
|
Off the top of my head I'm not sure what the problem could be. What architecture are you on? Only x86-64 is supported. @s0me0ne-unkn0wn did the security code change in the latest release? Maybe @maksimryndin's refactor broke something? |
@matthewmarcus moved the conversation to #2728 (comment) |
Summary
Running
--validator
without all security features available (with some exceptions) requires the--insecure-validator-i-know-what-i-do
flag.Overview
We've recently added Linux-only security features. This PR makes them "required" to enforce protection on validators (and thus the network), though there is a CLI flag to bypass the restrictions.
Screenshots
If all capabilities are present, or the insecure flag is passed
If some optional capabilities are missing
If some required capabilities are missing
Yeah, this is really ugly, because we return
SubsystemError
fromcandidate-validation
. We could move the check to before subsystem initialization, like we do for the worker version check. But I believe we want subsystems to be in charge of their own initialization and requirements (see #586.)Exceptions
Either one of the
landlock
or theunshare-and-change-root capability
are allowed to be missing if the other is present, because either provides FS security. We decided to be lax about this as Landlock requires Linux 5.13+ and many validators are not yet on this version. See #1444 (comment).Other
In this PR I also did the following. (It might be good to review commit-by-commit.)
secure_validator_mode
toSecurityStatus
, pass it to workers this way.(involved a bit of a refactor)
Some refactors:
spawn_with_program_path
TODO
There was an informative warning that was shown to validators so that they could prepare to meet the new security requirements. However, due to #2426, we announced to validators that they need to ignore the warning for one release. We could show the fixed warning in 1.5 and merge this in 1.6, but we may not have time with on-demand coming. See release calendar.
However, we recently decided that Landlock is optional and most validators are on x86-64. Almost no validators should actually need the insecure flag, so I believe that more warning time is not needed.
Currently it is pretty ugly (see screenshots above under "If some required capabilities are missing"). But I believe we want subsystems to be in charge of their own initialization and requirements (see #586.) We can move the check out of the subsystem, or make that shutdown less ugly maybe...
Related
Closes #1444
Closes #881
Closes #2495
Previous attempt: paritytech/polkadot#7073